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Abstract 

We derive an equation for temporal difference learning from statistical 
principles. Specifically, we start with the variational principle and then boot- 
strap to produce an updating rule for discounted state value estimates. The 
resulting equation is similar to the standard equation for temporal difference 
learning with eligibility traces, so called TD(A), however it lacks the param- 
eter a that specifies the learning rate. In the place of this free parameter 
there is now an equation for the learning rate that is specific to each state 
transition. We experimentally test this new learning rule against TD(A) and 
find that it offers superior performance in various settings. Finally, we make 
some preliminary investigations into how to extend our new temporal differ- 
ence algorithm to reinforcement learning. To do this we combine our update 
equation with both Watkins' Q(A) and Sarsa(A) and find that it again offers 
superior performance without a learning rate parameter. 
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1 Introduction 



In the field of reinforcement learning, perhaps the most popular way to estimate 
the future discounted reward of states is the method of temporal difference learning. 
It is unclear who exactly introduced this first, however the first explicit version of 
temporal difference as a learning rule appears to be Witten [Wit77]. The idea is as 
follows: The expected future discounted reward of a state s is, 

Vs := E {rfc + 'jrk+i + 7^^fc+2 H |sfc = «} , 

where the rewards r^.. r^+i, . . . are geometrically discounted into the future by 7 < 1. 
Prom this definition it follows that, 

Vs^E{rk + Ws,Jsk = s}. (1) 

Our task, at time t, is to compute an estimate of Vs for each state s. The 
only information we have to base this estimate on is the current history of state 
transitions, si, S2, ■ ■ ■ , St, and the current history of observed rewards, ri,r2, ■ ■ ■ ,rt. 
Equation (1) suggests that at time t + 1 the value of + 7Kt_,.i provides us with 
information on what should be: If it is higher than V^* then perhaps this estimate 
should be increased, and vice versa. This intuition gives us the following estimation 
heuristic for state Sj, 

where a is a parameter that controls the rate of learning. This type of temporal 
difference learning is known as TD(0). 

One shortcoming of this method is that at each time step the value of only the 
last state St is updated. States before the last state are also affected by changes in 
the last state's value and thus these could be updated too. This is what happens 
with so called temporal difference learning with eligibility traces., where a history, or 
trace, is kept of which states have been recently visited. Under this method, when 
we update the value of a state we also go back through the trace updating the earlier 
states as well. Formally, for any state s its eligibility trace is computed by, 

^ ■ \ 7A£;*-i + 1 if s = St, 

where A is used to control the rate at which the ehgibility trace is discounted. The 
temporal difference update is then, for all states s, 

:= + aEl (r + 7^^, - • (2) 

This more powerful version of temporal different learning is known as TD(A) [Sut88]. 

The main idea of this paper is to derive a temporal difference rule from statistical 
principles and compare it to the standard heuristic described above. Superficially, 



2 



our work has some similarities to LSTD(A) ([LP03] and references therein). However 
LSTD is concerned with finding a least-squares linear function approximation, it has 
not yet been developed for general 7 and A, and has update time quadratic in the 
number of features/states. On the other hand, our algorithm "exactly" coincides 
with TD/Q/Sarsa(A) for finite state spaces, but with a novel learning rate derived 
from statistical principles. We therefore focus our comparison on TD/Q/Sarsa. For 
a recent survey of methods to set the learning rate see [GP06]. 

In Section 2 we derive a least squares estimate for the value function. By ex- 
pressing the estimate as an incremental update rule wc obtain a new form of TD(A), 
which we call HL(A). In Section 3 we compare HL(A) to TD(A) on a simple Markov 
chain. Wc then test it on a random Markov chain in Section 4 and a non-stationary 
environment in Section 5. In Section 6 we derive two new methods for policy learn- 
ing based on HL(A), and compare them to Sarsa(A) and Watkins' Q(A) on a simple 
reinforcement learning problem. Section 7ends the paper with a summary and some 
thoughts on future research directions. 



2 Derivation 

The empirical future discounted reward of a state Sk is the sum of actual rewards 
following from state s^ in time steps fc, + 1, . . ., where the rewards are discounted 
as they go into the future. Formally, the empirical value of state at time k for 
k = 1, t is, 

00 

Vk E 7"-V„, (3) 

u=k 

where the future rewards Tu are geometrically discounted by 7 < 1. In practice 
the exact value of Vk is always unknown to us as it depends not only on rewards 
that have been already observed, but also on unknown future rewards. Note that if 
Sm — Sn for m ^ that is, we have visited the same state twice at different times 
m and n, this does not imply that v„ ~ Vm as the observed rewards following the 
state visit may be different each time. 

Our goal is that for each state s the estimate V* should be as close as possible 
to the true expected future discounted reward Vg- Thus, for each state s we would 
like Vg to be close to Vk for all k such that s = Sk- Furthermore, in non- stationary 
environments we would like to discount old evidence by some parameter A e (0, 1]. 
Formally, we want to minimise the loss function, 

L--^lt>^'-'i^k-V:f. (4) 

fc=l 

For stationary environments we may simply set A = 1 a priori. 

As we wish to minimise this loss, we take the partial derivative with respect to 
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the value estimate of each state and set to zero, 

f)T t t t 

^ = - E A*-^^. - vl)8.,s = v: E ^'-'Ss,s - E ^'-'Ss^sVk = 0, 

'^^s k=l k=l k=l 

where we could change V^^ into due to the presence of the Kronecker ^s^.^. defined 
Sxy := 1 if X = y, and otherwise. By defining a discounted state visit counter 
Nl-.^ EUA*-%.weget 

V:Nl = j2>^'-%.sVk. (5) 

k=l 

Since Vk depends on future rewards r^, Equation (5) can not be used in its current 
form. Next we note that has a self-consistency property with respect to the 
rewards. Specifically, the tail of the future discounted reward sum for each state 
depends on the empirical value at time t in the following way, 

t-i 

u=k 

Substituting this into Equation (5) and exchanging the order of the double sum, 

u=l k=l k=l 

t-1 u t 



u=l k=l 

Rl + Elvt, 



u=l k=l k=l 



where El := J2k=ii'^lY~''^sks is the eligibility trace of state s, and i?* := 
X]^=i A*~"£'^r„ is the discounted reward with ehgibility. 

El and i?* depend only on quantities known at time t. The only unknown 
quantity is Vt, which we have to replace with our current estimate of this value at 
time t, which is V^*. In other words, we bootstrap our estimates. This gives us, 

V,'Nl^Rl + ElVl. (6) 

For state s = St, this simplifies to V^^ = RlJ {Nl^ — El^). Substituting this back into 
Equation (6) we obtain, 

VlNl^Rl-rEl—^^. (7) 

St St 

This gives us an explicit expression for our V estimates. However, from an algorith- 
mic perspective an incremental update rule is more convenient. To derive this we 
make use of the relations, 

iVfi = \Nl + 5s,^,s, El-'' = X^El + 5,,^,,, = XRl + XEln, 
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with Ng = Eg = Rg = 0. Inserting these into Equation (7) with t replaced by t + 1, 



St+l ^St+l 

St+l ' St + l 

By solving Equation (6) for i?* and substituting back in, 

/ ^ St+l St + l ^ St+l St ' St + l' t 



St+l St + l St+l St ' St + l 

St+l I St+l 



^t+l st+i^ St+l ^ St+l ^ st~ ^ St+l' i 



^ TV* -7^;* 

-"st+i '-^s 



+ 

St + l ' St+l 

Dividing through by A^*+^(= AA^*+ 



^*+i ^ , -'^st+isK^-Ai^iK^+A^^r, 

(A7^^ + K+is){Ni^K+i - Elt+iK+ Ej^.n) 
{Nl^, - lEl^,)iXm+ Sst+is) 

Making the first denominator the same as the second, then expanding the nu- 
merator, 

^*+i ^yt, >^Eir,Ni^^^ ^E:\'!NI^^ '^.+iMK+i ^^^K^.Eln 

W+i-7^l,J(AA^i+<^st+is) 

A7^Ui-^sK*t + ^Ej+XK+is + >^lElNi^Xt+i - >^^ElEl^Xt 



St+l I St+l 

>^iElEl^^n + 8st+isNi,K+i - ^st+isEi,Xt + Sst+isEi,,n 



im,^,-7Eijixm+6s,^,,) 

After cancelling equal terms (keeping in mind that in every term with a Kronecker 
6xy factor we may assume that x = y as the term is always zero otherwise), and 
factoring out El we obtain, 

yt+i ^ yt I EjiXnNl,,, - XVlNj^^ + ^V,%,,,, + A7iVi,,,KUi " ^^t+isK^, + ^st+isrQ 

m^i - ^El+i)i>^Nj+ Ss,^,s) 

Finally, by factoring out XNl^^^ + we obtain our update rule, 

= + K Pt{s, st+,) in + 7Kt+i - vl), (8) 
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where the learning rate is given by, 



TV* - 7^;* 
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(9) 



Examining Equation (8), we find the usual update equation for temporal difference 
learning with eligibility traces (see Equation (2)), however the learning rate a has 
now been replaced by (3t{s,St+i)- This learning rate was derived from statistical 
principles by minimising the squared loss between the estimated and true state value. 
In the derivation we have exploited the fact that the latter must be self-consistent 
and then bootstrapped to get Equation (6). This gives us an equation for the 
learning rate for each state transition at time t, as opposed to the standard temporal 
difference learning where the learning rate a is either a fixed free parameter for all 
transitions, or is decreased over time by some monotonically decreasing function. In 
either case, the learning rate is not automatic and must be experimentally tuned for 
good performance. The above derivation appears to theoretically solve this problem. 

The first term in Pt seems to provide some type of normalisation to the learning 
rate, though the intuition behind this is not clear to us. The meaning of second term 
however can be understood as follows: iV* measures how often we have visited state 
s in the recent past. Therefore, if <C then state s has a value estimate based 
on relatively few samples, while state s^+i has a value estimate based on relatively 
many samples. In such a situation, the second term in Pt boosts the learning rate so 
that Vg'^^ moves more aggressively towards the presumably more accurate rt+^V^^^^. 
In the opposite situation when St+i is a less visited state, we see that the reverse 
occurs and the learning rate is reduced in order to maintain the existing value of Vg. 

3 A simple Markov process 

For our first test we consider a simple Markov process with 51 states. In each step 
the state number is either incremented or decremented by one with equal probability, 
unless the system is in state or 50 in which case it always transitions to state 25 
in the following step. When the state transitions from to 25 a reward of 1.0 is 
generated, and for a transition from 50 to 25 a reward of -1.0 is generated. All 
other transitions have a reward of 0. We set the discount value 7 = 0.99 and then 
computed the true discounted value of each state by running a brute force Monte 
Carlo simulation. 

We ran our algorithm 10 times on the above Markov chain and computed the 
root mean squared error in the value estimate across the states at each time step 
averaged across each run. The optimal value of A for HL(A) was 1.0, which was 
to be expected given that the environment is stationary and thus discounting old 
experience is not helpful. 

For TD(A) we tried various different learning rates and values of A. We could 
find no settings where TD(A) was competitive with HL(A). If the learning rate a 
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Figure 1: 51 state Markov process aver- 
aged over 10 runs. The parameter a is 
the learning rate a. 



Figure 2: 51 state Markov process aver- 
aged over 300 runs. 



was set too high the system would learn as fast as HL(A) briefly before becoming 
stuck. With a lower learning rate the final performance was improved, however the 
initial performance was now much worse than HL(A). The results of these tests 
appear in Figure 1. 

Similar tests were performed with larger and smaller Markov chains, and with 
different values of 7. HL(A) was consistently superior to TD(A) across these tests. 
One wonders whether this may be due to the fact that the implicit learning rate 
that HL(A) uses is not fixed. To test this we explored the performance of a number 
of different learning rate functions on the 51 state Markov chain described above. 
We found that functions of the form j always performed poorly, however good 
performance was possible by setting k correctly for functions of the form ^ and 
As the results were much closer, we averaged over 300 runs. These resu! Its appear 
in Figure 2. 

With a variable learning rate TD(A) is performing much better, however we 
were still unable to find an equation that reduced the learning rate in such a way 
that TD(A) would outperform HL(A). This is evidence that HL(A) is adapting the 
learning rate optimally without the need for manual equation tuning. 

4 Random Markov process 

To test on a Markov process with a more complex transition structure, we created 
a random 50 state Markov process. We did this by creating a 50 by 50 transition 
matrix where each element was set to with probability 0.9, and a uniformly random 
number in the interval [0, 1] otherwise. We then scaled each row to sum to 1. Then to 
transition between states we interpreted the i^^ row as a probability distribution over 
which state follows state i. To compute the reward associated with each transition 
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Figure 3: Random 50 state Markov pro- Figure 4: 21 state non-stationary Markov 

cess. The parameter a is the learning rate process. 

a. 

we created a random matrix as above, but without normahsing. We set 7 = 0.9 
and then ran a brute force Monte Carlo simulation to compute the true discounted 
value of each state. 

The A parameter for HL(A) was simply set to 1.0 as the environment is station- 
ary. For TD we experimented with a range of parameter settings and learning rate 
decrease functions. We found that a fixed learning rate of a = 0.2, and a decreasing 
rate of ^ performed reasonable well, but never as well as HL(A). The results were 
generated by averaging over 10 runs, and are shown in Figure 3. 

Although the structure of this Markov process is quite different to that used in 
the previous experiment, the results are again similar: HL(A) preforms as well or 
better than TD(A) from the beginning to the end of the run. Furthermore, stability 
in the error towards the end of the run is better with HL(A) and no manual learning 
tuning was required for these performance gains. 



5 Non-stationary Markov process 

The A parameter in HL(A), introduced in Equation (4), reduces the importance of 
old observations when computing the state value estimates. When the environment 
is stationary this is not useful and so we can set A = 1.0, however in a non-stationary 
environment we need to reduce this value so that the state values adapt properly 
to changes in the environment. The more rapidly the environment is changing, the 
lower we need to make A in order to more rapidly forget old observations. 

To test HL(A) in such a setting, we used the Markov chain from Section 3, but 
reduced its size to 21 states to speed up convergence. We used this Markov chain for 
the first 5,000 time steps. At that point, we changed the reward when transitioning 
from the last state to middle state to from -1.0 to be 0.5. At time 10,000 we then 
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switched back to the original Markov chain, and so on alternating between the 
models of the environment every 5,000 steps. At each switch, we also changed the 
target state values that the algorithm was trying to estimate to match the current 
configuration of the environment. For this experiment we set 7 = 0.9. 

As expected, the optimal value of A for HL(A) fell from 1 down to about 0.9995. 
This is about what we would expect given that each phase is 5,000 steps long. For 
TD(A) the optimal value of A was around 0.8 and the optimum learning rate was 
around 0.05. As we would expect, for both algorithms when we pushed A above its 
optimal value this caused poor performance in the periods following each switch in 
the environment (these bad parameter settings are not shown in the results). On the 
other hand, setting A too low produced initially fast adaption to each environment 
switch, but poor performance after that until the next environment change. To get 
accurate statistics we averaged over 200 runs. The results of these tests appear in 
Figure 4. 

For some reason HL(0.9995) learns faster than TD(0.8) in the first half of the 
first cycle, but only equally fast at the start of each following cycle. We arc not sure 
why this is happening. We could improve the initial speed at which HL(A) learnt 
in the last three cycles by reducing A, however that comes at a performance cost in 
terms of the lowest mean squared error attained at the end of each cycle. In any 
case, in this non-stationary situation HL(A) again performed well. 

6 Windy Gridworld 

Reinforcement learning algorithms such as Watkins' Q(A) [Wat89] and Sarsa(A) 
[RN94, Rum95] are based on temporal difference updates. This suggests that new 
reinforcement learning algorithms based on HL(A) should be possible. 

For our first experiment we took the standard Sarsa(A) algorithm and modified 
it in the obvious way to use an HL temporal difference update. In the presentation 
of this algorithm we have changed notation slightly to make things more consis- 
tent with that typical in reinforcement learning. Specifically, we have dropped the 
t super script as this is implicit in the algorithm specification, and have defined 
Q{s,a) := V(^s,a), E{s,a) := i?(s,a) and N{s,a) := N(^s,a)- Our new reinforcement 
learning algorithm, which we call HLS(A) is given in Algorithm 1. Essentially the 
only changes to the standard Sarsa(A) algorithm have been to add code to compute 
the visit counter N{s, a), add a loop to compute the values, and replace a with /3 
in the temporal difference update. 

To test HLS(A) against standard Sarsa(A) we used the Windy Gridworld envi- 
ronment described on page 146 of [SB98]. This world is a grid of 7 by 10 squares 
that the agent can move through by going either up, down, left of right. If the agent 
attempts to move off the grid it simply stays where it is. The agent starts in the 
4*^^ row of the column and receives a reward of 1 when it finds its way to the 4*^ 
row of the 8*^* column. To make things more difficult, there is a "wind" blowing the 
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Algorithm 1 HLS(A) 



Initialise Q{s, a) = 0, N{s, a) = 1 and E{s, a) = for all s, a 

Initialise s and a 

repeat 

Take action a, observed r, s' 

Choose a' by using e-greedy selection on Q[s' ,■) 

A^r + -fQ{s',a')-Q{s,a) 

E{s,a) ^ E{s,a) + 1 

N{s,a) ^ N{s,a) + 1 

for all s,a do 

a), (s , a )) <— jv(s',a')-7B{s',a') iV(sia) 

end for 

for all s, a do 

Q{s, a) ^ Q{s, a) + f3{{s, a), {s', a'))E{s, a)A 

E{s, a) ^ 'y\E{s, a) 

N{s,a) ^ XN{s,a) 
end for 
s <— s';a <— a' 
until end of run 



agent up 1 row in columns 4, 5, 6, and 9, and a strong wind of 2 in columns 7 and 
8. This is illustrated in Figure 5. Unlike in the original version, we have set up this 
problem to be a continuing discounted task with an automatic transition from the 
goal state back to the start state. 

We set 7 = 0.99 and in each run computed the empirical future discounted 
reward at each point in time. As this value oscillated we also ran a moving average 
through these values with a window length of 50. Each run lasted for 50,000 time 
steps as this allowed us to see at what level each learning algorithm topped out. 
These results appear in Figure 6 and were averaged over 500 runs to get accurate 
statistics. 

Despite putting considerable effort into tuning the parameters of Sarsa(A), we 
were unable to achieve a final future discounted reward above 5.0. The settings 
shown on the graph represent the best final value we could achieve. In comparison 
HLS(A) easily beat this result at the end of the run, while being slightly slower 
than Sarsa(A) at the start. By setting A = 0.99 we were able to achieve the same 
performance as Sarsa(A) at the start of the run, however the performance at the end 
of the run was then only slightly better than Sarsa( A) . This combination of superior 
performance and fewer parameters to tune suggest that the benefits of HL(A) carry 
over into the reinforcement learning setting. 

Another popular reinforcement learning algorithm is Watkins' Q(A). Similar to 
Sarsa(A) above, we simply inserted the HL(A) temporal difference update into the 
usual Q(A) algorithm in the obvious way. We call this new algorithm HLQ(A)(not 
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Figure 5: [Windy Gridworld] S marks the 
start state and G the goal state, at which 
the agent jumps back to S with a reward 
of 1. 
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Figure 6: Sarsa(A) vs. HLS(A) in the 
Windy Gridworld. Performance averaged 
over 500 runs. On the graph, e represents 
the exploration parameter e, and a the 
learning rate a. 



shown). The test environment was exactly the same as we used with Sarsa(A) above. 

The results this time were more competitive (these results are not shown). Never- 
theless, despite spending a considerable amount of time fine tuning the parameters of 
Q(A), we were unable to beat HLQ(A). As the performance advantage was relatively 
modest, the main benefit of HLQ(A) was that it achieved this level of performance 
without having to tune a learning rate. 



7 Conclusions 

We have derived a new equation for setting the learning rate in temporal difference 
learning with eligibility traces. The equation replaces the free learning rate parame- 
ter a, which is normally experimentally tuned by hand. In every setting tested, be it 
stationary Markov chains, non-stationary Markov chains or reinforcement learning, 
our new method produced superior results. 

To further our theoretical understanding, the next step would be to try to prove 
that the method converges to correct estimates. This can be done for TD(A) under 
certain assumptions on how the learning rate decreases over time. Hopefully, some- 
thing similar can be proven for our new method. In terms of experimental results, 
it would be interesting to try different types of reinforcement learning problems and 
to more clearly identify where the ability to set the learning rate differently for dif- 
ferent state transition pairs helps performance. It would also be good to generalise 
the result to episodic tasks. Finally, just as we have successfully merged HL(A) with 
Sarsa(A) and Watkins' Q(A), we would also like to see if the same can be done with 
Peng's Q(A) [PW96], and perhaps other reinforcement learning algorithms. 
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